This report explores a dataset containing video id,views count,number of likes,subscriber and other attributes for approximately 4,600 Youtube Trending videos.
## [1] 4547 23
## 'data.frame': 4547 obs. of 23 variables:
## $ video_id : Factor w/ 4547 levels "-_jlqATo9eo",..: 322 281 548 3186 1308 1796 374 2794 2271 3749 ...
## $ last_trending_date : Factor w/ 110 levels "2017-11-14","2017-11-15",..: 7 7 7 7 6 7 5 6 2 2 ...
## $ publish_date : Factor w/ 211 levels "2006-07-23","2008-04-05",..: 100 100 99 100 99 100 99 99 100 100 ...
## $ publish_hour : Factor w/ 24 levels "0","1","2","3",..: 18 8 20 12 19 20 6 22 15 14 ...
## $ category_id : Factor w/ 16 levels "1","2","10","15",..: 8 10 9 10 10 14 10 14 1 11 ...
## $ channel_title : Factor w/ 1905 levels "12 News","1MILLION Dance Studio",..: 274 938 1418 645 1215 751 1443 393 4 1834 ...
## $ views : int 2564903 6109402 5315471 913268 2819118 1038365 2688797 1251577 2671756 635985 ...
## $ likes : int 96321 151250 187303 16729 153395 22594 19042 28951 12699 20721 ...
## $ dislikes : int 7972 11508 7278 1386 2416 2798 3059 1146 505 2417 ...
## $ comment_count : int 22149 19820 9990 2988 20573 3142 2689 2606 1010 4111 ...
## $ comments_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ ratings_disabled : Factor w/ 2 levels "False","True": 1 1 1 1 1 1 1 1 1 1 ...
## $ tag_appeared_in_title_count: int 0 0 8 3 1 2 4 4 6 2 ...
## $ tag_appeared_in_title : logi FALSE FALSE TRUE TRUE TRUE TRUE ...
## $ title : Factor w/ 4540 levels "'Big one' knocks out several heavy-hitters, sends Daytona 500 to OT",..: 4264 3905 3167 2891 1854 107 3298 182 3777 4445 ...
## $ tags : Factor w/ 4190 levels "#MeToo|Grammys 2018|Janelle Monáe|Kesha",..: 3195 2058 2933 3042 3092 1654 3286 45 3827 4001 ...
## $ description : Factor w/ 4416 levels "","'A curious cat helps his owner with home improvements.'\\nWe're releasing a NEW BLACK & WHITE episode every wee"| __truncated__,..: 3136 2766 4111 3965 1675 3445 1024 1734 1897 1187 ...
## $ trend_day_count : int 7 7 7 7 6 7 5 6 2 2 ...
## $ trend.publish.diff : int 7 7 8 7 7 7 6 7 2 2 ...
## $ trend_tag_highest : int 2 65 68 488 488 38 488 113 151 39 ...
## $ trend_tag_total : int 2 69 426 1246 1007 122 2216 180 458 170 ...
## $ tags_count : int 1 4 23 28 14 7 42 13 28 20 ...
## $ subscriber : int 9086142 5937292 4191209 13186408 20563106 4652602 5292034 10474796 2453494 3808198 ...
## video_id last_trending_date publish_date publish_hour
## -_jlqATo9eo: 1 2018-03-05: 200 2018-02-05: 71 17 : 408
## -0NYY8cqdiQ: 1 2018-01-09: 141 2017-12-13: 70 16 : 404
## -1yT-K3c6YI: 1 2018-02-01: 85 2017-12-12: 67 15 : 342
## -2b4qSoMnKE: 1 2017-12-13: 70 2018-01-29: 66 18 : 324
## -2RVw2_QyxQ: 1 2017-11-14: 69 2017-11-15: 62 14 : 319
## -2wRFv-mScQ: 1 2017-11-22: 68 2018-01-26: 61 20 : 246
## (Other) :4541 (Other) :3914 (Other) :4150 (Other):2504
## category_id channel_title
## 24 :1102 The Tonight Show Starring Jimmy Fallon: 49
## 10 : 568 TheEllenShow : 44
## 25 : 436 ESPN : 41
## 26 : 413 Netflix : 41
## 23 : 380 Jimmy Kimmel Live : 39
## 22 : 352 Refinery29 : 39
## (Other):1296 (Other) :4294
## views likes dislikes comment_count
## Min. : 559 Min. : 0 Min. : 0 Min. : 0
## 1st Qu.: 90896 1st Qu.: 1486 1st Qu.: 76 1st Qu.: 226
## Median : 318840 Median : 7397 Median : 291 Median : 854
## Mean : 1265665 Mean : 39197 Mean : 2617 Mean : 4939
## 3rd Qu.: 1006673 3rd Qu.: 25576 3rd Qu.: 1023 3rd Qu.: 2862
## Max. :149376127 Max. :3093544 Max. :1674420 Max. :1361580
##
## comments_disabled ratings_disabled tag_appeared_in_title_count
## False:4471 False:4522 Min. : 0.000
## True : 76 True : 25 1st Qu.: 1.000
## Median : 3.000
## Mean : 2.961
## 3rd Qu.: 4.000
## Max. :18.000
##
## tag_appeared_in_title
## Mode :logical
## FALSE:701
## TRUE :3846
##
##
##
##
## title
## DORITOS BLAZE vs. MTN DEW ICE | Super Bowl Commercial with Peter Dinklage and Morgan Freeman: 2
## Justice League - Movie Review : 2
## Maroon 5 - Wait : 2
## Missouri Star Quilt Company Live Stream : 2
## NBA Bloopers - The Starters : 2
## Selena Gomez, Marshmello - Wolves : 2
## (Other) :4535
## tags
## The Late Show|Stephen Colbert|Colbert|Late Show|celebrities|late night|talk show|skits|bit|monologue|The Late Late Show|Late Late Show|letterman|david letterman|comedian|impressions|CBS|joke|jokes|funny|funny video|funny videos|humor|celebrity|celeb|hollywood|famous|James Corden|Corden|Comedy: 25
## James Corden|The Late Late Show|Colbert|late night|late night show|Stephen Colbert|Comedy|monologue|comedian|impressions|celebrities|carpool|karaoke|CBS|Late Late Show|Corden|joke|jokes|funny|funny video|funny videos|humor|celebrity|celeb|hollywood|famous : 23
## Viral|Video|Epic : 11
## cupcakes|how to make vanilla cupcakes|over the top recipes|easy cupcake recipes|vanilla cupcakes|chocolate cupcakes|french macarons|how to make macarons|the scran line|the scranline|nick makrides|pastry design|how to pipe cupcakes : 7
## nba|basketball|starters : 7
## (Other) :4266
## NA's : 208
## description
## : 89
## Jukin Media Verified (Original) * For licensing / permission to use: Contact - licensing(at)jukinmediadotcom\\nSubmit your videos here: http://bit.ly/2iFnUya : 11
## ⺠Listen LIVE: http://power1051fm.com/\\n⺠Facebook: https://www.facebook.com/Power1051NY/\\n⺠Twitter: https://twitter.com/power1051/\\n⺠Instagram: https://www.instagram.com/power1051/ : 10
## : 4
## To get this complete recipe with instructions and measurements, check out my website: http://www.LauraintheKitchen.com\\n\\nInstagram: http://www.instagram.com/mrsvitale\\n\\nOfficial Facebook Page: http://www.facebook.com/LauraintheKitchen\\n\\nContact: Business@LauraintheKitchen.com\\n\\nTwitter: @Lauraskitchen : 4
## Get Cut swag here: http://cut.com/shop\\n\\nDonât forget to subscribe and follow us!\\nYouTube: http://cut.com/youtube \\nFacebook: http://cut.com/facebook \\nInstagram: http://cut.com/instagram \\nSnapchat: @watchcut\\n\\nProduced, directed, and edited by https://cut.com \\n\\nWant to work with us? http://cut.com/hiring \\nWant to be in a video? http://cut.com/casting \\nLove Cut? Fill out this form for exclusive updates: http://cut.com/fanform \\n\\nWant to sponsor a video? http://cut.com/sponsorships \\nFor licensing inquiries: http://cut.com/licensing: 3
## (Other) :4426
## trend_day_count trend.publish.diff trend_tag_highest trend_tag_total
## Min. : 1.000 Min. : 0.00 Min. : 0.0 Min. : 0.0
## 1st Qu.: 3.000 1st Qu.: 5.00 1st Qu.: 22.0 1st Qu.: 68.0
## Median : 5.000 Median : 6.00 Median : 85.0 Median : 217.0
## Mean : 4.831 Mean : 34.43 Mean :130.3 Mean : 437.9
## 3rd Qu.: 7.000 3rd Qu.: 7.00 3rd Qu.:151.0 3rd Qu.: 515.0
## Max. :14.000 Max. :4215.00 Max. :488.0 Max. :3644.0
##
## tags_count subscriber
## Min. : 0.00 Min. : 0
## 1st Qu.: 9.00 1st Qu.: 246647
## Median :18.00 Median : 1198769
## Mean :19.21 Mean : 3164303
## 3rd Qu.:29.00 3rd Qu.: 3766915
## Max. :69.00 Max. :28676937
## NA's :22
Our dataset consists of 23 variables, with 4547 observations.
[Notice - I would create a new variable at Bivariate Plots Section.]
## Time difference of 105 days
So there are total 105 days of observation for Youtube trending videos.
I can see peak hour for publishing a trending video in Youtube in between 14:00 to 18:00 in USA timezone. Though it might be the case, that all videos are published during this hour, not only trending ones.
Youtube category wise distribution.
##
## 1 2 10 15 17 19 20 22 23 24 25 26 27 28 29
## 228 66 568 113 306 49 53 352 380 1102 436 413 175 291 13
## 43
## 2
It can be clearly seen ,category_id = 24 is the category where highest number(1102) of Youtube trending videos were published.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 246647 1198769 3164303 3766915 28676937 22
There are 22 trending video where subscriber information set to NA.
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 208 51 49 94 141 136 113 160 153 130 152 144 117 134 108 124 108 103
## 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 95 116 104 98 109 109 105 89 123 110 97 84 143 97 90 97 93 81
## 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
## 80 61 65 37 32 25 28 23 22 16 12 11 8 12 9 13 2 3
## 54 55 56 57 58 59 61 62 63 65 69
## 2 3 2 4 2 1 2 4 1 1 1
We can see , there are 208 videos which are not using any tag.
#by which.max() trying to find the index of maximum value of the table
# generated from 'tags_count' column.And then using that index,find out the
# table content.So generally that will be the mode of the column & frequency.
table(YtUsa$tags_count)[which.max(table(YtUsa$tags_count))]
## 0
## 208
mode of tags_count is 0.
## [1] 18
Median tags_count is 18.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 3.000 2.961 4.000 18.000
From the Histogram & the summary ,we can see, Q3-Q1=68% of the Youtube trending videos are using 1 to 4 tags those are also appeared in the video title.
## Mode FALSE TRUE
## logical 701 3846
Out of 4547 trending videos , 701 video title had not used any of its tag (or keyword) on the video title.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.831 7.000 14.000
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 604 455 484 480 611 698 602 279 153 63 53 46 18 1
From 105 days of observation , it can be seen ,that a particular video trended for 14 days at most.
We can see there are 604 videos those had appeared in the Youtube trending video list for only once.
Long tailed distribution transformed for better understanding of the comment counts. Since we are using log10 transformation on x-axis for comment_count, we have to apply comment_count+1 to overcome Infinite values.
## False True
## 4471 76
There are only 76 videos where comments_disabled variable set to True.
We can see,maximum number of Youtube videos are listed on trending, within 0 to 14 days of the video publishing date.
2 days trending data are missing in mid January.
We can clearly see ,there is multi-modal distribution over trend_tag_highest. And trend_tag_total distribution is long tailed positively(right) skewed. Both of them are under non-symmetric distribution.
log10() applied on both of the x scale.
Now I am plotting distribution of some key features of the dataset with log10 transformation.
There are 4,547 unique video ids(observations) in the dataset with 23 features.15 of them are independent feature & rest of them are dependent feature.“video_id” is the unique feature of the dataset.
Trending report recorded for 105 days. Category id 24 is the most used category for trending videos. 1,102 of 4,547 videos were published under that category. From 105 days of observation ,we observed that no video repeated(re-trended) on trending list for more than 14 times. It is also observed that most of the videos get listed on Youtube trending list within 0 to 14 days of video publishing date. 84% or more trending videos are using one of its tag on the video title for at least once.
The Median views for a Youtube trending video = 318840. The Median comment count for a Youtube trending video = 854. Maximum number of dislikes a Youtube trending video got 1674420. The Median subscriber count for a Youtube trending video = 1198769. 208 trading videos did not include any tag on the video. 604 videos were trended for only once, means those video were never re-trended within 105 days of period.
Main features in the dataset are: views,comment_count,likes,dislikes & subscriber. I like to determine which features are best for predicting the views of a Youtube trending video. I suspect comment_count,likes,dislikes,subscriber and some combination of the other features could be used to build a predictive model to determine views count of a Youtube trending video.
Other features in the dataset are : category_id, tag_appeared_in_title, trend_tag_highest(Maximum number of times all trending videos used one of the tag,which is used on the video), trend_tag_total(Total number of times all trending videos used any of the tag ,which are used for the video), trend_day_count(Number of times a video listed on Youtube Trend).
Yes ,I created 5 new variables those could be derived from the dataset, names are : tag_appeared_in_title_count,tag_appeared_in_title,trend_tag_highest, trend_tag_total,tags_count.
When I investigate ,I found only 76 observations, where comments_disabled feature is enabled. Also found only 25 observations with ratings_disabled feature enabled. These numbers are very low in respect of total number of observations. I think ,those two are important features of the dataset ,but due to lack of their availability,we would not be confident to make any assumption based upon them.
Yes I performed operations on the original dataset to make it tidy.So this dateset is modified version of the original one. In original dataset there were multiple observation for same video_id . That original dataset could be used for observation as a time series, but as you could see from the feature “trend_day_count” ,not all video_id(s) were repeated in Youtube trending. Therefore time series not available for every video. So for that case, I had to filter trend_day_count > 1 & that would remove 604 trending videos. But for this project ,I liked to observe each & every trending videos.
One of the interesting point I would like to share,correlation between likes & comment_count = 0.71 . And correlation between dislikes & comment_count = 0.83 . So we can claim, that more people involved in conversation when they were disliking a video rather than liking a video. Most of these cases ,video might be controversial or a fake news,etc.
## [1] 0.8209508
From the above plot we can see, there is a very strong relationship between views & likes. And the value of the correlation between them is 0.82.
Since log10 applied on the x-axis & and there are few videos in Youtube trending list with 0 likes, thats why we have to pass the variable (likes+1) instead of likes into the scale_x_log10() function. That would help to overcome infinite values(since log10(0) = Inf).
Therefore in the above plot on x-axis scale 1 represent 0.
We can see there are many outliers on y-axis for x = 1. Many of those video authors might be disabled video rating ,so users can’t like or dislike the video.
Another point to see,after 10^4=10000 likes ,variance of likes decreases as views increases.
Plot almost looks similar to views v/s likes plot, but in this case ,variance of dislikes is bigger for some places.
## [1] 0.5289388
Correlation between views & dislikes = 0.53 .
## [1] -0.02158679
Correlation between views and ratio of the likes & dislikes is very weak.
## [1] 0.7128881
Correlation between views and comment_count is vey strong.
Trending videos with less than 500 video description length has lower mean(average) views count than others. [Please see - we observed top 95% CI]
From the above plot,linear regression line represents,as the average length of video titles are increasing, average views counts are slightly decreasing.
tag_appeared_in_title_count not much effecting count of views. [Please see - we observed 2-tailed 95% CI]
## [1] 0.02458608
As like expected correlation between views & tag_appeared_in_title_count = 0.02,which is very weak.
## [1] 0.1904766
For top 95% of the views counts & trend_day_count data, we can say ,as mean trend_day_count are increasing mean views counts are increasing rapidly.
Correlation between views & trend_day_count = 0.19
## [1] 0.3594205
Rank correlation(method=“spearman”) between views & trend_day_count is 0.3594205
Video category_id related with views counts.
## [1] -0.116636
And the value of correlation is -0.116636 (-0.12)
# Finding correlation between views & category_id only for top 95% CI of
# views
# Since class of category_id is factor, we need to change it to numeric for
# correlation calculation.
with(subset(YtUsa,views >= quantile(views,0.0) & views <=
quantile(views,0.95)),cor(views,as.numeric(category_id)))
## [1] -0.1038592
Correlation between top 95% of views & category_id = -0.1038592
To make more sense about categorical distribution,lets create a new variable called “subscriber_by_category”,which will represent 3 groups of data.
low : group of all categort_id’s those have small(<=7552015) number of subscribers. medium : group of all categort_id’s those have moderate(from 7552016 to 18185017) number of subscribers. high : group of all categort_id’s those have huge(>18185017) number of subscribers.
Here, low < medium < high
# remember subscriber have NA's, so we need to use a subset of the dataset.
# group by category_id
cat_groups <- group_by(subset(YtUsa,!is.na(subscriber)),category_id)
YtUsa.subs_by_cat <- summarise(cat_groups,subs_max = max(subscriber),n=n())
YtUsa.subs_by_cat <- arrange(YtUsa.subs_by_cat,subs_max)
#Here YtUsa.subs_by_cat is a new data.frame containing 3 columns:
# category_id,subs_max,n
#subs_max would represent the highest subscriber of a video channel for that
# Youtube category_id
#n represent number of Youtube trending videos are present in the category.
YtUsa$subscriber_by_category <- NA # new variable created & NA assigned to
# its all values.
#now assigning new values from YtUsa.subs_by_cat DataFrame as per conditions
YtUsa[YtUsa$category_id %in%
YtUsa.subs_by_cat[YtUsa.subs_by_cat$subs_max<=7552015,]
$category_id,]$subscriber_by_category <- "low"
YtUsa[YtUsa$category_id %in%
YtUsa.subs_by_cat[YtUsa.subs_by_cat$subs_max> 7552015 &
YtUsa.subs_by_cat$subs_max<=18185017,]
$category_id,]$subscriber_by_category <- "medium"
YtUsa[YtUsa$category_id %in%
YtUsa.subs_by_cat[YtUsa.subs_by_cat$subs_max>18185017,]
$category_id,]$subscriber_by_category <- "high"
# change variable class to factor
YtUsa$subscriber_by_category <- factor(YtUsa$subscriber_by_category)
# ordered to low < medium < high
YtUsa$subscriber_by_category <-
ordered(YtUsa$subscriber_by_category,levels=c(
levels(YtUsa$subscriber_by_category)[2:3],
levels(YtUsa$subscriber_by_category)[1]))
#To calculate cor, variables must be numeric, that's why, converting
# subscriber_by_category low,medium,high to numeric 1,2,3.
with(YtUsa,cor(views,as.numeric(subscriber_by_category)))
## [1] 0.09658316
correlation in between views & subscriber_by_category = 0.10
So categories with greater level of subscribers have more chance of getting more Youtube viewers.
## YtUsa$subscriber_by_category: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 33768 144773 521240 472936 25244097
## --------------------------------------------------------
## YtUsa$subscriber_by_category: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 773 83726 287705 925453 808356 56111957
## --------------------------------------------------------
## YtUsa$subscriber_by_category: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 704 125034 406864 1676927 1315056 149376127
It can be seen that 1st Quantile, Median & 3rd Quantile views counts are highly affected by the ‘subscriber_by_category’ feature.
## [1] 0.4602942
Likes & Dislike has a strong relationship with correlation value of 0.46
##
## Pearson's product-moment correlation
##
## data: views and comment_count
## t = 47.182, df = 4545, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5535431 0.5925758
## sample estimates:
## cor
## 0.5733848
Correlation(method=pearson) between views & comment_count is very strong & its value is 0.5733848
##
## Spearman's rank correlation rho
##
## data: views and comment_count
## S = 2738300000, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.8252309
rho is a nonparametric measure of rank correlation. rho between views & comment_count is very strong too.And its value is 0.8252309
Regression Line for views v/s trend_tag_highest is monotonic here.
## [1] -0.01307464
And correlation between views & trend_tag_highest is -0.013 here.
## [1] -0.02185687
Relationship between views & trend_tag_total is non linear.
Now Boxplots for views v/s trend_day_count for 2 tailed 95% CI :-
## YtUsa$trend_day_count: 1
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 748 41278 163307 426914 466968 9632678
## --------------------------------------------------------
## YtUsa$trend_day_count: 2
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 704 47412 185620 510526 540362 14161833
## --------------------------------------------------------
## YtUsa$trend_day_count: 3
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 34282 146532 668972 666358 43449654
## --------------------------------------------------------
## YtUsa$trend_day_count: 4
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 988 55959 204197 828470 608572 21582276
## --------------------------------------------------------
## YtUsa$trend_day_count: 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1464 79040 268343 993644 853708 26448434
## --------------------------------------------------------
## YtUsa$trend_day_count: 6
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1949 149819 372212 928398 866277 41088994
## --------------------------------------------------------
## YtUsa$trend_day_count: 7
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4318 271626 667336 2147018 1641607 57951412
## --------------------------------------------------------
## YtUsa$trend_day_count: 8
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3170 235848 639536 2611971 2025050 149376127
## --------------------------------------------------------
## YtUsa$trend_day_count: 9
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6688 448169 1115398 3050553 2448968 91933007
## --------------------------------------------------------
## YtUsa$trend_day_count: 10
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 24296 479238 1520884 7000378 4553615 102012605
## --------------------------------------------------------
## YtUsa$trend_day_count: 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16425 188322 485534 2535995 1891032 34269048
## --------------------------------------------------------
## YtUsa$trend_day_count: 12
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 68886 345371 1046764 1465997 2090906 7721222
## --------------------------------------------------------
## YtUsa$trend_day_count: 13
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 451602 1685408 2888654 7769948 5305317 45938392
## --------------------------------------------------------
## YtUsa$trend_day_count: 14
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17540613 17540613 17540613 17540613 17540613 17540613
Median views count for trend_day_count : 1 = 163307 . Median views count for trend_day_count : 14 = 17540613 . And 17540613 >> 163307 . So if a video get listed more times on Youtube trending,its median views count is way more bigger(for maximum cases).
## subscriber_by_category: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 230 1011 5315 5930 97030
## --------------------------------------------------------
## subscriber_by_category: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2264 8843 27730 25045 1988746
## --------------------------------------------------------
## subscriber_by_category: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2015 9124 55122 33120 3093544
Median likes count for low group of ‘subscriber_by_category’ is far lower than the median likes count for medium & high.
## subscriber_by_category: low
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 163289 599310 1132174 1685948 7552015 2
## --------------------------------------------------------
## subscriber_by_category: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 214706 1099202 2305559 3008137 18185017 9
## --------------------------------------------------------
## subscriber_by_category: high
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 335046 1759496 4238331 5292034 28676937 11
Median subscriber for subscriber_by_category: low = 599310 . Median subscriber for subscriber_by_category: medium = 1099202 . Median subscriber for subscriber_by_category: high = 1759496 .
Boxplots for views v/s tag_appeared_in_title for 2 tailed 95% CI shown below:
## YtUsa$tag_appeared_in_title: FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 704 53227 247049 1167267 795873 56111957
## --------------------------------------------------------
## YtUsa$tag_appeared_in_title: TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 97816 333984 1283600 1042283 149376127
Median views Value for not tag_appeared_in_title = 247049 . Median views Value for tag_appeared_in_title = 333984
From above observations ,we can say,there is a impact on Youtube trending videos views count over tag_appeared_in_title or not.
Above plot showing an obvious point. If tag_appeared_in_title set to False, then tag_appeared_in_title_count should be 0(since one variable derived from another).
For a trending Youtube video ,if difference between first trending date & publish date is less than 4 days,then it would not be re-trended for more than 3 times on Youtube.
## YtUsa$comments_disabled: False
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 559 92536 320792 1258650 1010413 149376127
## --------------------------------------------------------
## YtUsa$comments_disabled: True
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 748 23922 149300 1678329 862810 56111957
Though, there are only 76 observations present ,out of 4547 for comments_disabled = True. But still we can see there is a huge impact on Youtube trending videos views count over comments_disabled or not.
Looks like ratio of likes/dislikes is not uniformal for Youtube trending videos.
It is looks like, most frequently used tags on videos are attached from caterogoty_id 23 & 24
Significant amount of videos are re-trended everyday from subscriber_by_category: high & subscriber_by_category: medium group.
## YtUsa$tag_appeared_in_title: FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.655 7.000 13.000
## --------------------------------------------------------
## YtUsa$tag_appeared_in_title: TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 5.000 4.863 7.000 14.000
Looks like, number of days a video trended does not much affected by whether the tag appeared on the video title or not.
## [1] 0.003732712
By just observing number of tags(tags_count) attached on a trending video, we could not say how many views that video would get.
There are a good amount of outliers exist in the dataset. This outliers have only few number of subscribers and yet they managed to get higher number of views count.
By Applying geom_smooth() with & without linear method(lm), its look like average count of views increasing as per average number of subscriber increasing. Though relation is not strong. [please see - we observerd the data for 2-tailed 95% CI]
## [1] 0.2657179
As like expected ,Correlation coefficient between views & subscriber is 0.27 .
Now I am buiding a linear model for views v/s likes :
##
## Call:
## lm(formula = I(views) ~ I(likes), data = subset(YtUsa, !is.na(subscriber) &
## !is.na(tags)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -47329618 -245824 -185412 31998 68253405
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.275e+05 4.100e+04 5.55 3.03e-08 ***
## I(likes) 2.615e+01 2.719e-01 96.17 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2596000 on 4321 degrees of freedom
## Multiple R-squared: 0.6816, Adjusted R-squared: 0.6815
## F-statistic: 9248 on 1 and 4321 DF, p-value: < 2.2e-16
For the linear model ,Multiple R-squared value between views & likes = 0.6816
Youtube trending videos views counts strongly correlated with likes ,dislikes ,comment_count. As the likes or dislikes or comment_count increases,views of a video also increases. The relationship between views count and likes or dislikes or comment_count is almost linear.
Linear Coefficient Correlation(method = pearson) between views & likes = 0.82
Linear Coefficient Correlation(method = pearson) between views & dislikes = 0.53
Linear Coefficient Correlation(method = pearson) between views & comment_count = 0.57
And rank Correlation(method = Spearman) between views & comment_count = 0.8252309
Point to see, all above correlations are positive & strong.
Linear Correlation coefficient(method = pearson) between views & subscriber = 0.27
For tag_appeared_in_title: TRUE , median views count is 333984. For tag_appeared_in_title: FALSE , median views count is 247049.
Median subscriber for subscriber_by_category: low = 599310 . Median subscriber for subscriber_by_category: medium = 1099202 . Median subscriber for subscriber_by_category: high = 1759496 .
Most frequestly used tags on videos are attached from caterogoty_id 23 & 24.
Based on the R^2 value,likes explains about 68 percent of the variance in views.Other features of interest can be incorporated into the model to explain the variance in the views.
Yes, Pearson Coefficient Correlation(PCC) value between views & category_id is -0.12, PCC between views & tag_appeared_in_title = 0.01, PCC between views & trend_tag_highest = -0.01, PCC between views & trend_tag_total = -0.02, PCC between views & trend_day_count = 0.19, PCC in between views & subscriber_by_category = 0.10 .
1st Quantile, Median & 3rd Quantile views counts are highly affected by the ‘subscriber_by_category’ feature.
The relationship between views count and category_id or trend_tag_highest or trend_day_count is monotonic. The relationship between views count and trend_tag_total is non linear.
Strongest relation between two feature in the dataset are views & likes. Coefficient Correlation between them : 0.82
Trend showing ,variance of views per day is getting bigger for higher number of subscriber groups.
By just seeing how many times a video get appeared on the Youtube trend,we can’t imagine how many views a video can get,because variance of views for each day is very big.So range is very bigger here.
## subscriber_by_category: low
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 186 9957 36099 126459 124612 5048819
## --------------------------------------------------------
## subscriber_by_category: medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 247 21370 62628 182627 173469 6769490
## --------------------------------------------------------
## subscriber_by_category: high
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 248 31536 107318 357914 328971 18672016
From above observation we can say subscriber_by_category: high got maximum median views per day count(107318)
## tag_appeared_in_title: FALSE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 247 14459 55604 246723 198642 6769490
## --------------------------------------------------------
## tag_appeared_in_title: TRUE
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 186 25111 82070 270246 237691 18672016
Median views per day count for tag_appeared_in_title : True is higher than the Median views per day count for tag_appeared_in_title : False
On a trending video comments_disabled or ratings_disabled has an impact over number of viewer per day.
## [1] 0.04160659
Correlation between (views/trend_day_count) & trend_tag_highest is 0.042, which is slightly better than the correlation between views & trend_tag_highest(-0.013),which was calculated during Bivariate Plots Section
Above plot representing top 95% data of (views v/s likes) coloured by different trend_day_count.
Above plot representing top 95% data of (views v/s likes) coloured by 3 different group of categories by subscription numbers.
Across the plot We can see a clear dominance of group medium & high for high views count & high likes for Youtube trending videos.
Above plot showing tag_appeared_in_title has an impact over views v/s top 95% of trend.publish.diff variables.
Above plot representing top 95% views v/s subscriber_by_category data coloured by different trend_day_count.
Above plot representing top 95% views per day v/s subscriber_by_category color differentiate by tag_appeared_in_title or not.
Now I am plotting same plot,but instated of views per day, this time I am using total views of a video.
Though last 2 plots looks similar. But take a close look on y-axis(views) scale value.Its scale value much more bigger than previous plot, because it represents total views of a video_id, not average views on a day.
By taking a closer look on size of the bubble,we can observe,trending videos those have listed for more than 5 times got the highest number of views.
There are multiple number of outliers exist in the plot, many trending videos got huge number of views ,but their subscriber counts are very less. Lets focus to 0 to 1000 subscriber and apply facet_wrap for comments_disabled & ratings_disabled variables.
So we can see, other than ratings_disabled & comments_disabled set to True videos ,there are many other outliers present with subscriber = 0 with huge number of views. I don’t know, the real reason behind this.It might be there are some lurking variables, which might be causing this issue.
## [1] 0.2693141
PCC between views & subscriber with respect of tag_appeared_in_title : True = 0.27
## [1] 0.2444701
PCC between views & subscriber with respect of tag_appeared_in_title : False = 0.24 .
From the last plot & above observations ,it could be said ,that correlation between views & subscriber when tag appeared in the title is more stronger than when tag does not appear in the title.
On other words, if a trending video title does not contain any tag of it,then its number of subscriber & views counts might also get affected(lower).
From the above plot(I considered 2-tailed 95% CI) we can see ,there are few categories where variance of views per day & subscribe are very big. Categories are : 10,23,24.
Also we can say,the video channel, which had the highest level of subscribers for Youtube trending videos, is belongs to category_id: 23
From the above plot(I considered 2-tailed 95% CI) we can say,video belongs to categories where they have highest level of subscribers;those videos are using at least one of its tag on the trending video title.
Lest’s print out the model table :-
##
## Calls:
## m1: lm(formula = I(views) ~ I(likes), data = subset(YtUsa, !is.na(subscriber) &
## !is.na(tags)))
## m2: lm(formula = I(views) ~ I(likes) + comment_count, data = subset(YtUsa,
## !is.na(subscriber) & !is.na(tags)))
## m3: lm(formula = I(views) ~ I(likes) + comment_count + dislikes,
## data = subset(YtUsa, !is.na(subscriber) & !is.na(tags)))
## m4: lm(formula = I(views) ~ I(likes) + comment_count + dislikes +
## trend_day_count, data = subset(YtUsa, !is.na(subscriber) &
## !is.na(tags)))
## m5: lm(formula = I(views) ~ I(likes) + comment_count + dislikes +
## trend_day_count + category_id, data = subset(YtUsa, !is.na(subscriber) &
## !is.na(tags)))
## m6: lm(formula = I(views) ~ I(likes) + comment_count + dislikes +
## trend_day_count + category_id + tag_appeared_in_title, data = subset(YtUsa,
## !is.na(subscriber) & !is.na(tags)))
## m7: lm(formula = I(views) ~ I(likes) + comment_count + dislikes +
## trend_day_count + category_id + tag_appeared_in_title + subscriber,
## data = subset(YtUsa, !is.na(subscriber) & !is.na(tags)))
##
## ===============================================================================================================================================================================================================
## m1 m2 m3 m4 m5 m6 m7
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## (Intercept) 227538.928*** 223254.443*** 272224.491*** -140074.259* 453376.011** 680180.046*** 682721.882***
## (40997.972) (41041.876) (33945.316) (68312.277) (162477.320) (185747.621) (186304.987)
## I(likes) 26.150*** 26.694*** 31.970*** 31.626*** 32.124*** 32.148*** 32.165***
## (0.272) (0.388) (0.341) (0.343) (0.350) (0.350) (0.363)
## comment_count -3.481* -94.436*** -93.792*** -94.771*** -94.994*** -95.023***
## (1.766) (2.502) (2.491) (2.486) (2.486) (2.491)
## dislikes 75.130*** 74.996*** 74.928*** 74.994*** 75.006***
## (1.679) (1.670) (1.656) (1.656) (1.657)
## trend_day_count 87385.740*** 93747.348*** 94321.998*** 94142.424***
## (12586.797) (12657.862) (12652.105) (12692.833)
## category_id: 2 343339.268 339255.564 337885.111
## (300565.657) (300384.327) (300514.672)
## category_id: 10 -1150715.770*** -1138430.648*** -1137304.134***
## (173793.532) (173754.849) (173887.165)
## category_id: 15 -672396.419** -672963.888** -672860.020**
## (250231.202) (250076.686) (250105.481)
## category_id: 17 -330136.845 -320163.924 -316810.857
## (192873.103) (192794.727) (193715.043)
## category_id: 19 -580428.071 -586212.922 -587315.148
## (344618.305) (344413.049) (344506.268)
## category_id: 20 -579426.034 -606973.159 -605818.068
## (330401.999) (330379.534) (330479.067)
## category_id: 22 -764477.273*** -778314.746*** -777219.480***
## (191293.348) (191254.337) (191372.705)
## category_id: 23 -1022817.922*** -1032406.476*** -1027304.877***
## (183815.371) (183741.367) (185936.967)
## category_id: 24 -440257.056** -438662.353** -436462.550**
## (161040.744) (160942.486) (161424.329)
## category_id: 25 -609647.157*** -628134.362*** -628862.425***
## (180841.552) (180879.298) (180944.902)
## category_id: 26 -657792.027*** -651797.293*** -651178.657***
## (180466.315) (180370.562) (180423.620)
## category_id: 27 -867162.101*** -877987.638*** -876734.028***
## (219525.715) (219432.309) (219567.582)
## category_id: 28 -589269.122** -590792.585** -590813.534**
## (195438.361) (195318.539) (195340.544)
## category_id: 29 -2281129.093*** -2311105.192*** -2313453.198***
## (631008.269) (630731.041) (630936.980)
## category_id: 43 -1078145.794 -1052072.679 -1051764.831
## (1501807.627) (1500915.476) (1501085.278)
## tag_appeared_in_title -257303.258* -256639.079*
## (102328.747) (102406.820)
## subscriber -0.001
## (0.007)
## ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
## R-squared 0.682 0.682 0.783 0.785 0.790 0.790 0.790
## adj. R-squared 0.681 0.682 0.782 0.785 0.789 0.789 0.789
## sigma 2596319.808 2595453.197 2145557.086 2133928.400 2113155.980 2111850.258 2112087.803
## F 9248.460 4629.261 5183.681 3942.298 851.634 810.369 771.608
## p 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -69982.076 -69980.132 -69156.697 -69132.703 -69082.893 -69079.719 -69079.703
## Deviance 29127327557805164.000 29101149937090996.000 19882150280982512.000 19662662495539220.000 19214737528568204.000 19186539321071680.000 19186394929287012.000
## AIC 139970.152 139968.265 138323.395 138277.406 138207.787 138203.438 138205.405
## BIC 139989.267 139993.751 138355.253 138315.636 138341.593 138343.616 138351.955
## N 4323 4323 4323 4323 4323 4323 4323
## ===============================================================================================================================================================================================================
Now summary of final model :-
##
## Call:
## lm(formula = I(views) ~ I(likes) + comment_count + dislikes +
## trend_day_count + category_id + tag_appeared_in_title + subscriber,
## data = subset(YtUsa, !is.na(subscriber) & !is.na(tags)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -22887876 -440250 -116257 195134 55268218
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.827e+05 1.863e+05 3.665 0.000251 ***
## I(likes) 3.217e+01 3.629e-01 88.627 < 2e-16 ***
## comment_count -9.502e+01 2.491e+00 -38.141 < 2e-16 ***
## dislikes 7.501e+01 1.657e+00 45.263 < 2e-16 ***
## trend_day_count 9.414e+04 1.269e+04 7.417 1.44e-13 ***
## category_id2 3.379e+05 3.005e+05 1.124 0.260925
## category_id10 -1.137e+06 1.739e+05 -6.540 6.85e-11 ***
## category_id15 -6.729e+05 2.501e+05 -2.690 0.007166 **
## category_id17 -3.168e+05 1.937e+05 -1.635 0.102028
## category_id19 -5.873e+05 3.445e+05 -1.705 0.088303 .
## category_id20 -6.058e+05 3.305e+05 -1.833 0.066849 .
## category_id22 -7.772e+05 1.914e+05 -4.061 4.97e-05 ***
## category_id23 -1.027e+06 1.859e+05 -5.525 3.49e-08 ***
## category_id24 -4.365e+05 1.614e+05 -2.704 0.006882 **
## category_id25 -6.289e+05 1.809e+05 -3.475 0.000515 ***
## category_id26 -6.512e+05 1.804e+05 -3.609 0.000311 ***
## category_id27 -8.767e+05 2.196e+05 -3.993 6.63e-05 ***
## category_id28 -5.908e+05 1.953e+05 -3.025 0.002505 **
## category_id29 -2.313e+06 6.309e+05 -3.667 0.000249 ***
## category_id43 -1.052e+06 1.501e+06 -0.701 0.483547
## tag_appeared_in_titleTRUE -2.566e+05 1.024e+05 -2.506 0.012245 *
## subscriber -1.301e-03 7.232e-03 -0.180 0.857230
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2112000 on 4301 degrees of freedom
## Multiple R-squared: 0.7902, Adjusted R-squared: 0.7892
## F-statistic: 771.6 on 21 and 4301 DF, p-value: < 2.2e-16
R^2 value of the final model collectively explains only about 79 percent of the variance in views. Residual standard error: 2112000 on 4301 degrees of freedom.
I observed,by just seeing how many times a video get appeared on the Youtube trend,we can’t imagine how many views a video can get,because variance of views for each day is very big.
Median views per day for subscriber_by_category: low = 36099 . Median views per day for subscriber_by_category: medium = 62628 . Median views per day for subscriber_by_category: high = 107318 .
Median views per day count for tag_appeared_in_title : True is 82070 . Median views per day count for tag_appeared_in_title : False is 55604 .
Video channel which had the highest level of subscribers for Youtube trending videos,belongs to category_id = 23 .
From top 95% data points of views per day & subscriber,we observed there are few categories where variance of views & subscribe are very big. Categories are : 10,23,24 .
From plots,I saw many of videos have only 0 like, but have higher number of views. And many of videos have zero(0) subscriber,but still get to manage higher number of viewers; those are probably the outlier of the dataset. To filter them,I focused on 0 to 1000 subscriber and apply facet_wrap for comments_disabled & ratings_disabled variables. But surprisingly,comments_disabled=TRUE & ratings_disabled=TRUE ,could not be able to filter out all of the outliers. There must be something else going on ,may be some lurking variables causing this nature.
Trending videos those have listed for more than 5 times(or days), got the highest number of views count. Pearson Coefficient Correlation(PCC) between views & subscriber is more stronger when one of its tag appeared in the title. Variance of views per day is getting bigger for higher number of subscriber groups(low < medium < high for subscriber_by_category).
Yes, Many of the trending videos have lower number of subscriber & yet they managed to get more number of viewers than top subscriber channels. Also I saw there are many trending videos managed to get higher number of views counts,but they have very few likes, many of them have 0 like only, I think some of 0 like videos came from videos where ratings_disabled set to True.
Yes, For the final model(m9) ,I got Multiple R-squared: 0.7902 & Residual standard error: 2112000 . So R^2 value of the final model collectively explains only about 79 percent of the variance in views. Also its Residual standard error is very big here & that would cause a large range of Confidence Interval for predictive model. In other words standard error is bigger for the final model. Therefore this model would not be able to calculate/predict views counts accurately.
The model I am looking for, not just for predicting views count of a trending video;I want to flag a video, if its fall behind 95% CI.
From the above plot, we can analyze that no video trended over 14-day period. We can see there are more than 600 videos those were appeared in the Youtube trending video list for only once(that was the first & last time).
Since log10 applied on the x-axis & and there are few videos in Youtube trending list with 0 likes, because of this, we have to pass the variable (likes+1) instead of likes into the scale_x_log10() function. That would help to overcome infinite values(since log10(0) = Inf). Therefore on the above plot, on x-axis, x=1 represents 0 like and x=100 represents 99 likes and so on.
I have applied two smoother lines on the above plot ,one with linear method (red line) & another without linear method(blue line). Here smoother line(Slope Of Regression Line) represents the slope of the line of best fit in the scatterplot. Since there is very strong relationship between views & likes attribute (cor=0.82),hence the slope of the linear line nearer to 1.
We can see there are many outliers exist on y-axis for x = 1 . Many of those video authors might disable video rating ,so users are not able to like or dislike the video. Those outliers causing non-linear regression line to start from (x=1,y=10000).
Notice - I took the subset of the dataset to filter out subscriber with NA.
After considering only top 95% viewers & subscribers, we can see ,under any group, trending videos those have not included any of its tag on the video title,tend to have lower number of subscriber & viewers, than videos those have included one(at least).
Trending Videos under subscriber_by_category: high & subscriber_by_category: medium were able to achieve highest level of views in Youtube.
From bubble size ,we can say ,trending Videos those have trended for more than 3 to 4 times(or days),were able to achieve highest level of viewers in Youtybe.
From linear regression line ,we can say,other than videos under subscriber_by_category: low,as per average number of subscriber increases, average number of viewers also increases.
The Youtube Trending data set contains information on almost 4,600 unique videos across 23 variables.And it is recorded for total 105 days. [I have a created a categorical variable extra though.]
I started by understanding the individual variables in the data set, and then I explored interesting questions and leads as I continued to make observations on plots. Eventually, I explored the views of videos across many variables. Views count of a video strongly correlated with likes,dislikes,comment_count. Highest correlation exist between views & likes variables,where cor=0.82 . But there exist some other important features :subscriber,trend_day_count, category_id; they have good correlation values. That’s why I include all of the above features to build predictive model. Model was able to account for only 79% of the variance in the dataset. Also model Residual standard error was big.So model would not be very useful to predict views count accurately for a Youtube trending video.
Right now,I might be unable to build a good predictive model,but I found some interesting facts about Youtube trending videos; these are:- 84% or more trending videos are using one of its tag on the video title for at least once. Other than 604 trending videos,all trending videos are appeared in the trending list for more than 1 once. Maximum number of Youtube videos are listed on trending, within 0 to 14 days of the video publishing date. More users engaged in conversation when they were disliking a trending video rather than liking a trending video. If difference between first trending date & publish date is less than 4 days, then there is a big chance,that video would not be re-trended for more than 3 times. There is a impact on Youtube trending videos views count over tag_appeared_in_title or not. Trending videos those have listed for more than 5 times got the highest number of views. Videos belongs to categories where number of subscriber is/are most ;those videos are using at least one of its tag on the trending video title.
Struggle:-
I struggled to create these two variables : trend_tag_highest, trend_tag_total .
I failed to create these two variables from the dataset directly;it might happen,since I was new in R-Programming.
But some how,by creating a separate(temporary) dataframe from ‘tags’ variable and by using that temporary dataframe,I succeed to create those new variables: trend_tag_highest, trend_tag_total . Trick behind this was to create a similar data structure of Python dictionary in R-Programming. Though I believe there should be some easy technique to achieve the same goal.
Surprise:-
Many of Youube trending videos get listed on trending list for more than 1 time(or day), but they did not get higher number of traffics.
Another point I already discussed,many of the trending videos have lower number of subscriber(some of them have 0) & yet they managed to get greater number of viewers than top subscriber channels present in the Youtube. Also I saw there are many trending videos managed to get higher number of views counts,but they have very few likes(many of them have 0).
Future work :-
In future we could use a OCR(Optical character recognition) technique to scan the thumbnail image to observe whether the thumbnail using any text or not.
We can make a new variable called ‘tag_appeared_in_description’ with the help of ‘tags’ & ‘description’ variables,it would be very similar to the variable already exist: ‘tag_appeared_in_title’ . With similar approach, we can make another variable called ‘title_appeared_in_desccription’ with the help of ‘title’ & ‘description’ variables.
We can create some bucket variables(using cut function) for ‘last_tredning_date’ and ‘tags_count’ variable to make really use of those 2 features.
We can do some element grouping,that might help to extract some hidden information from the data.
Limitation of the dataset:-
There exist a latency(time delay) in the dataset,happened due to scrapping the subscriber column data separately(Subscriber data did not come with the original dataset,so it is recorded during different time).
Possible Lurking Variables :-
There are some lurking variables those could affect the main features of the dataset(views,likes,dislikes,comment_count,subscriber).For example what is the content of the video thumbnail,which is used in the trending video.A auto generated thumbnail(by Youtube system) might cause lower number of views, likes,etc.While on the other hand, if a trending video using a custom thumbnail or a thumbnail which was manually uploaded by the author.And if its containing a perfect image or a text that could provoke viewers to click and check the video(in other word Clickbait),that would affect the trending statistic easily.
Alternatively, if a trending video containing a content which was the trend of the day on the Internet,then it could easily get more attention than any other trending videos irrespective of how many subscriber the video channel had.
Another thing,video might not well presented,either its graphics or sound quality was poor or presentation was poor.
It is also possible that some dishonest users are using some blackhat technique to bypass Youtube trending algorithm.And still yet Youtube algorithms did not flag it.